knitr::opts_chunk$set(echo = TRUE)

1 Introduction to this Project

This project provides a complete run-down and protocol for RNA-Seq and other HTS-data analytics starting with programmatic sourcing of read data and metadata, cleaning and mapping read data with subread, souring reference genome and associated annotation data from Ensembl, counting mapped reads by features with featureCounts in R, and a full analytic, inferential and visualization workflow for bulk RNA-seq data with an eye towards comparing the performance, quality of fit, uncertainty in predicted estimates, power to predict DEGs of alternative DEG-prediction pipelines edgeR and limma-voom.

1.1 Practical and Biological Background and Motivation for This Project

In order to make this project doable in a reasonable time on a personal computer with limited storage, memory, network bandwidth and compute power, we will only download a subset of the read data from an already known example with full results to compare to, and compare our results from under-sampled data for power to detect DEGs

We will analyze data from one of the edgeR User Guide’s Case Study Examples, taken from Section 4.4 “RNA-Seq profiles of mouse mammary gland” of the EdgeR User’s Guide, last revised 26 October 2024. The following description of the study is modified from the User’s Guide:

The RNA-Seq data of this case study is described in Fu, N.Y. et al. (2015). EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival. Nature Cell Biology 17:365 (note that genes are written in italic face). The sequence and count data are publicly available from the Gene Expression Omnibus (GEO) at the series accession number GSE60450. This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates. This is summarized in the table below, where the basal and luminal cell types are abbreviated with B and L respectively.

According to the methods section of this paper:

RNA-seq analysis. Total RNA was extracted from sorted luminal or basal populations from the mammary glands of virgin, 18.5-day-pregnant and 2-day-lactating FVB/N female mice (two independent samples per stage). Total RNA (100 ng) was used to generate libraries for whole-transcriptome analysis following Illumina’s TruSeq RNA v2 sample preparation protocol. Libraries were sequenced on an Illumina HiSeq 2000 at the Australian Genome Research Facility (AGRF), Melbourne. At least 20 million 100 bp single-end reads were obtained for each sample. Reads were aligned to the mouse genome mm10 using Rsubread version 13.25 (ref. 51). The number of reads overlapping each Entrez gene was counted using RefSeq gene annotation and featureCounts. Filtering and normalization used the edgeR package. Genes were filtered as unexpressed if their average count per million (CPM) computed by the aveLogCPM function was less than one. Compositional differences between libraries were normalized using the trimmed mean of M-values (TMM) method. Differential expression was analysed used the Limma package. Counts were transformed to log2-CPM values with associated precision weights using voom. Differential expression was assessed using the TREAT method, whereby differential expression is evaluated relative to a biologically meaningful fold-change threshold. Genes were considered to be differentially expressed if they achieved a false discovery rate of 1% in exceeding a fold-change of 1.5. Gene ontology analysis used the goseq package, which corrects the analysis for gene length bias. Log-CPM values shown in heat maps were computed using the edgeR package with a prior.count of 3 to reduce variability at low counts.

2 Setting up Project Directory

We organized the project directory to ensure all files and outputs were structured for efficient analysis. The folder contained subdirectories for the analysis script, figures, downloaded data (reads and reference genomes), and related annotation data. This setup facilitated smooth progression through the RNA-Seq workflow.

2.1 Downloaded Metadata and Read Data

We began by obtaining metadata and read data from the NCBI GEO database. Specifically, we accessed the GEO accession GSE60450 via the SRA Run Selector tool. From this portal, we downloaded the “Accession List” (a list of SRR accession numbers) and the “Metadata” file. These files were essential for linking sample information to the raw sequencing data. The metadata file was downloaded as SraRunTable.csv, while the accession list was saved as SRR_Acc_List.txt.

For downloading the raw reads, we utilized the UNIX-based SRA Toolkit. This tool allowed us to programmatically fetch data corresponding to the accession numbers. Using a subset of reads ensured the workflow remained manageable on personal computing resources while preserving sufficient data for meaningful analysis.

2.2 Cleaned and Processed Metadata

Once the metadata was downloaded, we cleaned and reshaped it into a format compatible with our analysis pipeline. Specifically, we created a targets.txt file, which summarized the key attributes of each sample. This step involved sorting and extracting information about cell type, biological status, and GEO accession numbers from the SraRunTable.csv. The resulting file provided a structured view of the experimental design, serving as input for downstream analyses.

2.3 Extracted Sequence Statistics

To assess the sequence libraries, we queried statistics such as the number of reads for each sample. This step ensured that the downloaded data aligned with our expectations and experimental requirements. The statistics were extracted programmatically from the SRR accession list, with outputs stored in a summary file. This information confirmed the integrity and consistency of the dataset.

2.4 Preprocessing and Quality Control

2.4.1 Downloaded and Trimmed Reads

We downloaded a subset of reads from each sample to reduce computational overhead while retaining representative data. The read files were processed to remove technical artifacts such as adapter sequences and low-quality bases. This was achieved using the fastp tool, which performed trimming and cleaning operations. For each processed file, we generated detailed HTML and JSON reports summarizing the trimming statistics.

2.4.2 Assessed Quality of Reads

To evaluate the quality of the cleaned reads, we used FastQC. This tool provided visual summaries of sequence quality metrics, such as per-base quality scores, duplication levels, and GC content. The outputs were reviewed to identify any potential issues with the sequencing data.

2.4.3 Aggregated Quality Reports

Finally, we used MultiQC to aggregate the quality control results from fastp and FastQC. This step produced a comprehensive HTML report summarizing the quality metrics across all samples. The report highlighted any inconsistencies or areas requiring further attention, ensuring the data was ready for mapping and downstream analysis.

2.5 Downloaded and Prepared the Reference Genome

To map reads, we downloaded the masked primary assembly of the mouse reference genome (GRCm39) from Ensembl, along with its associated annotation file in GTF format. These files provided the genomic coordinates and features needed for aligning RNA-Seq reads and counting gene-level expression. The reference genome files were organized into a dedicated subdirectory within the project folder.

2.6 Built Genome Index

Using the downloaded reference genome, we built a genome index to facilitate efficient read alignment. The indexing process prepared the genome for compatibility with the subread alignment tool, creating binary index files optimized for rapid read mapping. This step was critical for ensuring accurate and resource-efficient alignments.

2.7 Mapped Reads to the Genome

We mapped the processed reads to the mouse reference genome using subread-align, a tool known for its speed and accuracy. The alignment parameters included options for handling multi-mapped reads and specifying the RNA-Seq data type. We also adjusted the number of processing threads to optimize runtime based on system resources. The resulting BAM files stored the mapped read data, which would be used for downstream feature counting and differential expression analysis.

3 Count Reads Mapping to Features in R

3.1 Install Rsubread

BiocManager::install("Rsubread")

3.2 Set-up “Targets” (Sample Information) and Groups

Here we’ll use the targets.txt file that we set-up earlier to parametrize sample information in our analysis, and combine the factors of cell-type (Basal vs Luminal cells) and life history stage of the mice (virgin, pregnant, lactating) into a single one-way factor for analysis.

targets <- read.delim("targets.txt", header=TRUE)
targets
##                     FileName GEOAccession CellType   Status
## 1  SRR1552450.fastq.fastp.gz   GSM1480297        B   virgin
## 2  SRR1552451.fastq.fastp.gz   GSM1480298        B   virgin
## 3  SRR1552452.fastq.fastp.gz   GSM1480299        B pregnant
## 4  SRR1552453.fastq.fastp.gz   GSM1480300        B pregnant
## 5  SRR1552454.fastq.fastp.gz   GSM1480301        B  lactate
## 6  SRR1552455.fastq.fastp.gz   GSM1480302        B  lactate
## 7  SRR1552444.fastq.fastp.gz   GSM1480291        L   virgin
## 8  SRR1552445.fastq.fastp.gz   GSM1480292        L   virgin
## 9  SRR1552446.fastq.fastp.gz   GSM1480293        L pregnant
## 10 SRR1552447.fastq.fastp.gz   GSM1480294        L pregnant
## 11 SRR1552448.fastq.fastp.gz   GSM1480295        L  lactate
## 12 SRR1552449.fastq.fastp.gz   GSM1480296        L  lactate
group <- factor(paste0(targets$CellType, ".", targets$Status))

3.3 Count Mapped Reads

By default, Rsubread::featureCounts() uses genome feature annotations in its custom “SAF” format, so we need to tell it that the mouse external annotation is in “GTF” format.

output.files <- sub("^","GSE60450/",sub(".gz", ".subread-align.bam", targets$FileName))
library(Rsubread)
fc <- featureCounts(output.files,annot.ext = "Mus_musculus.GRCm39/Mus_musculus.GRCm39.113.gtf.gz",isGTFAnnotationFile = T)
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.20.0
## 
## //========================== featureCounts setting ===========================\\
## ||                                                                            ||
## ||             Input files : 12 BAM files                                     ||
## ||                                                                            ||
## ||                           SRR1552450.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552451.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552452.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552453.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552454.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552455.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552444.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552445.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552446.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552447.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552448.fastq.fastp.subread-align.bam         ||
## ||                           SRR1552449.fastq.fastp.subread-align.bam         ||
## ||                                                                            ||
## ||              Paired-end : no                                               ||
## ||        Count read pairs : no                                               ||
## ||              Annotation : Mus_musculus.GRCm39.113.gtf.gz (GTF)             ||
## ||      Dir for temp files : .                                                ||
## ||                 Threads : 1                                                ||
## ||                   Level : meta-feature level                               ||
## ||      Multimapping reads : counted                                          ||
## || Multi-overlapping reads : not counted                                      ||
## ||   Min overlapping bases : 1                                                ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Load annotation file Mus_musculus.GRCm39.113.gtf.gz ...                    ||
## ||    Features : 1298756                                                      ||
## ||    Meta-features : 78298                                                   ||
## ||    Chromosomes/contigs : 38                                                ||
## ||                                                                            ||
## || Process BAM file SRR1552450.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 295393                                               ||
## ||    Successfully assigned alignments : 235837 (79.8%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552451.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 296230                                               ||
## ||    Successfully assigned alignments : 236257 (79.8%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552452.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 295938                                               ||
## ||    Successfully assigned alignments : 231752 (78.3%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552453.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 294821                                               ||
## ||    Successfully assigned alignments : 231191 (78.4%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552454.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 296134                                               ||
## ||    Successfully assigned alignments : 236174 (79.8%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552455.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 294426                                               ||
## ||    Successfully assigned alignments : 234396 (79.6%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552444.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 295965                                               ||
## ||    Successfully assigned alignments : 229511 (77.5%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552445.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 294388                                               ||
## ||    Successfully assigned alignments : 226867 (77.1%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552446.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 294469                                               ||
## ||    Successfully assigned alignments : 231811 (78.7%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552447.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 296011                                               ||
## ||    Successfully assigned alignments : 234503 (79.2%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552448.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 296512                                               ||
## ||    Successfully assigned alignments : 242904 (81.9%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Process BAM file SRR1552449.fastq.fastp.subread-align.bam...               ||
## ||    Single-end reads are included.                                          ||
## ||    Total alignments : 296514                                               ||
## ||    Successfully assigned alignments : 244110 (82.3%)                       ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## || Write the final count table.                                               ||
## || Write the read assignment summary.                                         ||
## ||                                                                            ||
## \\============================================================================//
colnames(fc$counts) <- 1:12

4 Create DGEList for Analysis

The edgeR package is used for elementary functions in setting up the data object for analysis for both limma and edgeR. Make sure to install edgeR from BioConductor using BiocManager if it’s not already installed.

4.1 Create and Parametrize DGEList

library(edgeR)
y <- DGEList(fc$counts, group=group)
colnames(y) <- targets$GEO

4.2 Remap Ensemble Gene IDs to More Recognizable Gene Symbols

We will install the Mouse AnnotationHub package org.Mm.eg.db for remapping gene identifiers from BioConductor using BiocManager.

require(org.Mm.eg.db)
Symbol <- mapIds(org.Mm.eg.db, keys=rownames(y), keytype="ENSEMBL",column="SYMBOL")
Symbol[is.na(Symbol)] <- names(Symbol[is.na(Symbol)])
y$genes <- data.frame(Symbol=Symbol)

5 Filter Low-Expressed Genes

keep <- filterByExpr(y)
summary(keep)
##    Mode   FALSE    TRUE 
## logical   71711    6587
y <- y[keep, , keep.lib.sizes=FALSE]
y$samples
##                 group lib.size norm.factors
## GSM1480297   B.virgin   214393            1
## GSM1480298   B.virgin   215480            1
## GSM1480299 B.pregnant   212958            1
## GSM1480300 B.pregnant   213176            1
## GSM1480301  B.lactate   217806            1
## GSM1480302  B.lactate   215123            1
## GSM1480291   L.virgin   206660            1
## GSM1480292   L.virgin   203972            1
## GSM1480293 L.pregnant   214745            1
## GSM1480294 L.pregnant   218515            1
## GSM1480295  L.lactate   231758            1
## GSM1480296  L.lactate   232741            1

6 Normalize Count Data with TMM

There are at least three problems with the math in Robinson and Oshlack (2010)’s write-up of their highly competitive TMM normalization scale factor for RNA-Seq data. Unfortunately, these errors affect and obscore the definition of the central quantity of interest. Let’s review them:

  1. There is inconsistent notation between the main body of text and Materials and Methods, such as that the TMM scaling factor is written differently in the main text (\(f_k\)) and in the “Materials and Methods” where details are given (\(TMM^{(r)}_k\)), and different ways of symbolizing the reference sample in the two sections.

  2. The first critical math error is that \(M^r_{gk}\) is not correctly re-defined in the “Materials and Methods” section, althouugh it is correctly defined in the main body of the text as:

    • \(M_g = \log_2 \frac{Y_{gk}/N_k}{Y_{gk^{\prime}}/N_{k^{\prime}}}\) , but in the Materials and Methods it is incorrectly written as:

    • \(M^r_{gk} = \frac{\log_2 (Y_{gk}/N_k)}{\log_2(Y_{gr}/N_r)}\) — please note that the logarithm of a fraction is the not equivalent to the fraction of logarithms.

  3. The second critical math error is that the weight \(w^r_{gk}\) is defined as the “delta method”(Taylor-approximation)-based expression for the variance of the M-value, but these weights are described in the text as the reciprocal of the variance of the M-value.

Here is a corrected definition for the TMM normalization scaling factor. This definition has been validated based on what the edgeR::calcNormFactors() code actually does:

The TMM normalization scaling factor \(f^r_k\) of sample \(k\) relative to sample \(r\) (Robinson and Oshlack (2010)) and actual calculation of edgeR::calcNormFactors()is:

\[ f^r_k = 2^{(\sum_{g \in G^*} w^r_{gk}M^r_{gk}/\sum_{g \in G^*} w^r_{gk})} \]

where \(M^r_{gk} = \log_2 \frac{Y_{gk} N_r}{Y_{gr} N_k}\) and \(w^r_{gk} = \big( \frac{N_k - Y_{gk}}{N_k Y_{gk}} + \frac{N_r -Y_{gr}}{N_r Y_{gr}} \big)^{-1}\).

y.calc <- calcNormFactors(y) ## same as normLibSizes for this data
y.calc$samples$norm.factors
##  [1] 1.2324081 1.2237250 1.1616580 1.0839663 1.0465246 1.0521370 1.3678000
##  [8] 1.3508543 1.0572050 0.9853094 0.5019826 0.4949874

7 Visualize Normalized Counts with Mean-Difference Plots (AKA “MA Plots”)

sapply(1:12,function(x){plotMD(cpm(y.calc, log=TRUE), column=x);abline(h=0, col="red", lty=2, lwd=2)})

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL

7.1 Exploratory Data Analysis: Visualize Multi-Dimensional Scaling (MDS) Plot of Samples

Limma’s MDSplot() function is implemented to project the similarities of samples in a low-dimensional projected space. This plot is useful to detect batch effects for example, in which case we would see samples clustering in batches rather than in sample condition groups.

Here is what the limma documentation says about its implementation:

This function is a variation on the usual multdimensional scaling (or principle coordinate) plot, in that a distance measure particularly appropriate for the microarray context is used. The distance between each pair of samples (columns) is the root-mean-square deviation (Euclidean distance) for the top top genes. Distances on the plot can be interpreted as leading log2-fold-change, meaning the typical (root-mean-square) log2-fold-change between the samples for the genes that distinguish those samples. If gene.selection is “common”, then the top genes are those with the largest standard deviations between samples. If gene.selection is “pairwise”, then a different set of top genes is selected for each pair of samples. The pairwise feature selection may be appropriate for microarray data when different molecular pathways are relevant for distinguishing different pairs of samples.

Here, we use the default pairwise value of gene.selection.

points <- c(0,1,2,15,16,17)
colors <- rep(c("blue", "darkgreen", "red"), 2)
library(limma)
plotMDS(y, col=colors[group], pch=points[group])
legend("topleft", legend=levels(group), pch=points, col=colors, ncol=2)

8 Analyze Differentially Expressed Genes (DEGs) in Undersampled Data

8.1 Create Design Matrix

Here we use the “group-means” parametrization for a one-way layout; we can use the contrast matrix to define contrasts and interaction effects for the two main factors of cell-type and life history stage of the mice.

design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
design
##    B.lactate B.pregnant B.virgin L.lactate L.pregnant L.virgin
## 1          0          0        1         0          0        0
## 2          0          0        1         0          0        0
## 3          0          1        0         0          0        0
## 4          0          1        0         0          0        0
## 5          1          0        0         0          0        0
## 6          1          0        0         0          0        0
## 7          0          0        0         0          0        1
## 8          0          0        0         0          0        1
## 9          0          0        0         0          1        0
## 10         0          0        0         0          1        0
## 11         0          0        0         1          0        0
## 12         0          0        0         1          0        0
## attr(,"assign")
## [1] 1 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

8.2 Define Contrasts of Interest

Here we define contrasts between subsequent life-history stages in mice for each cell type and interaction effects comparing each pair of stages across cell-types.

contrast.matrix <- makeContrasts(
     Lact_vs_Preg_in_B=B.lactate-B.pregnant,
     Lact_vs_Preg_in_L=L.lactate-L.pregnant,     
     Diff_LP=(L.lactate-L.pregnant)-(B.lactate-B.pregnant),
     Preg_vs_Virg_in_B=B.pregnant-B.virgin,
     Preg_vs_Virg_in_L=L.pregnant-L.virgin,     
     Diff_PV=(L.pregnant-L.virgin)-(B.pregnant-B.virgin),
     levels=design)

8.3 Compute DEGs with limma-voom

voom.fit <- voom(y.calc,design,plot=TRUE)

limma.lm.fit <- lmFit(voom.fit, design)
limma.contrasts.fit <- contrasts.fit(limma.lm.fit, contrast.matrix)
limma.ebayes.fit    <- eBayes(limma.contrasts.fit,robust=T)

8.3.1 Sortable and Searchable Top Tables of limma-voom DEGs for Different Contrasts

We will need to install the DT package for the datatable function useful to make an sortable and searchable table of results of top DEGs in your knitted output.

By default limma::topTable sorts genes by their \(B\)-value, the posterior log-odds of differential expression.

For example, in the following code-chunk we generate a top-table of the top 100 genes for the contrast of lactating vs pregnant mice in luminal cells features in Figure 6b of Fu et al. (2015), defined as in the contrast matrix so that large B-values correspond to excess expression in the Luminal cells of lactating mice (alternatively sorting by \(p\)-value would give you both overexpressed and underexpressed genes in this contrast). This is the second contrast defined in the contrast-matrix, so it corresponds to the second coefficient for the call to topTable().

limma.tt <- topTable(limma.ebayes.fit, coef = 2, number = 100, sort.by = "B") # M == LogFC
library(DT)
datatable(limma.tt)

We can “sanity-check” our analysis by comparing results to the paper. Below is Figure 6B, and its corresponding caption from the paper.

knitr::include_graphics("Fig6b_FuEtAl15.png")
Figure 6b from Fu *et al.* (2015)

Figure 6b from Fu et al. (2015)

  1. Heat map of gene expression for Mcl-1 and cytokines/growth factors that are differentially expressed between late pregnancy and early lactation for the luminal population.

For this contrast, we are interested in genes listed with a strong positive LFC in Luminal (L) cells in lactating mice compared to pregnant mice, which have red signals in the heatmap on the lower half of the leftmost column of this figure.

The gene with the sixth-highest B-value is Spp1 which is shown in the figure. Some other genes shown in the figure are not present in our top 100 genes because they were filtered as low-expression genes in our undersampled reads. But other genes shown in the figure such as Cxcl1 and Cxlcl2 are in these top-100 genes.

8.4 Compute DEGs with edgeR

8.4.1 Estimate and Visualize Dispersion

edgeR.estimateDisp <- estimateDisp(y.calc, design, robust=TRUE)
plotBCV(edgeR.estimateDisp)

8.4.2 Fit Negative Binomial Model and Compute Estimated LogFC for each Gene and Contrast

The second contrast, that we looked at previously with limma-voom, compares Luminal Cells in lactating vs. pregnant mice. Here we pass in the entire contrast matrix for glmQLFTest and the corresponding \(p\)-values obtained are from a likelihood-ratio test over all contrasts.

edgeR.glmQLF.fit <- glmQLFit(edgeR.estimateDisp, design, robust=TRUE)
plotQLDisp(edgeR.glmQLF.fit)

edgeR.glmQLF.estimate <- glmQLFTest(edgeR.glmQLF.fit, contrast = contrast.matrix)

8.4.3 Sortable and Searchable Top Table of edgeR DEGs for All Contrasts

The edgeR QL-pipeline-equivalent of topTable is topTags but it works differently: it sorts genes by default by their \(p\)-value and if we sort by logFC it sorts by absolute value of logFC, and when passing in the entire contrast matrix it computes \(p\)-values from a likelihood-ratio test over all contrasts. Note that edgeR::topTags also returns a different data structure than limma::topTable.

To make a topTable more comparable to the one we examined before from limma, we can pass in only one specific column from the contrast matrix to glmQLFTest like so:

edgeR.glmQLF.estimate.coef2 <- glmQLFTest(edgeR.glmQLF.fit, contrast=contrast.matrix[,2])
edgeR.tt <- topTags(edgeR.glmQLF.estimate.coef2, n = 100) # by default sorted by Pvalue
datatable(edgeR.tt$table)

We should still be able to find results for the genes that we looked at before in this table, like Spp1 and Cxcl2.

8.5 Compare Numbers of DEGs and Distributions of \(p\)-values Obtained in Undersampled Data Between edgeR and limma-voom

Let’s compare which method yields more DEGs from our undersampled data for contrast 2 (Luminal cells in Lactating vs Pregnant mice). In order to get p-values for this one contrast, we have to re-run glmQLFTest with just the specific column of the contrast matrix we are interested in (see here for more information).

limma.tt <- topTable(limma.ebayes.fit, coef = 2, number = Inf)
edgeR.tt <- topTags(edgeR.glmQLF.estimate.coef2, n = Inf)

Here is a table of number of DEGs/NDEGs for all contrasts from limma for a Benjamini-Hochberg FDR of 10%:

summary(decideTests(limma.ebayes.fit,p.value=0.1))
##        Lact_vs_Preg_in_B Lact_vs_Preg_in_L Diff_LP Preg_vs_Virg_in_B
## Down                 307               576     314               259
## NotSig              5860              5219    5849              5897
## Up                   420               792     424               431
##        Preg_vs_Virg_in_L Diff_PV
## Down                 615     424
## NotSig              5416    5927
## Up                   556     236

Here is the result for just contrast 2 using edgeR.

summary(decideTests(edgeR.glmQLF.estimate.coef2,p.value=0.1))
##        1*L.lactate -1*L.pregnant
## Down                         577
## NotSig                      5275
## Up                           735

Limma finds more DEGs in our undersampled data for this contrast. If we compare the histogram of \(p\)-values we get a clue as to why:

hist(limma.tt$P.Value)

hist(edgeR.tt$table$PValue) 

There appears to be an excess of large \(p\)-values near 1 for edgeR; this indicates that perhaps the nonparametric mean-variance trend modeling of limma may better compensate for under-sampling in this data and not enough data is available to parametrize the distributional mean-variance trend modeling of edgeR.

9 Compare Numbers of DEGs and Distributions of \(p\)-values Obtained in Full Counts Data Between edgeR and limma-voom

Fu et al. (2015) used limma-voom to analyze their data. Let’s see if edgeR performs better than limma when analyzing the full data-set of counts that the authors provide on GEO. The results shown here should also be more comparable to the output shown in the edgeR UserGuide Section 4.4 for this data.

First we download and then read in the counts, and then we follow the same analytical workflow documented above.

9.1 Read in Full Counts Data Matrix

fullcounts <- read.delim("GSE60450/GSE60450_Lactation-GenewiseCounts.txt")

9.2 Rename Columns of Counts Data

The names of columns here need to be mapped back to the GSM identifiers to make the data comparable to what we have analyzed from our under-sampled read-mapped data. The identifiers may be mapped by processing the file “GSE60450_series_matrix.txt.gz” obtainable when we click the link “Series Matrix File(s)” from the GEO record. We have done this and renamed the data columns in the code chunk below.

sample_names <- c("MCL1.LA_BC2CTUACXX_GATCAG_L001_R1","MCL1.LB_BC2CTUACXX_TGACCA_L001_R1","MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1","MCL1.LD_BC2CTUACXX_GGCTAC_L001_R1","MCL1.LE_BC2CTUACXX_TAGCTT_L001_R1","MCL1.LF_BC2CTUACXX_CTTGTA_L001_R1","MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1","MCL1.DH_BC2CTUACXX_CAGATC_L002_R1","MCL1.DI_BC2CTUACXX_ACAGTG_L002_R1","MCL1.DJ_BC2CTUACXX_CGATGT_L002_R1","MCL1.DK_BC2CTUACXX_TTAGGC_L002_R1","MCL1.DL_BC2CTUACXX_ATCACG_L002_R1")

ref_ids <- c("GSM1480291","GSM1480292","GSM1480293","GSM1480294","GSM1480295","GSM1480296","GSM1480297","GSM1480298","GSM1480299","GSM1480300","GSM1480301","GSM1480302")

names(ref_ids) <- sample_names

names(fullcounts)[3:14] <- ref_ids[names(fullcounts)[3:14]]

9.3 Set up DGElist, Filter and Normalize Data

y.full <- DGEList(fullcounts[,3:14], group=group)
library(org.Mm.eg.db)
Symbol <- mapIds(org.Mm.eg.db, keys=as.character(fullcounts$EntrezGeneID), keytype="ENTREZID",column="SYMBOL")
Symbol[is.na(Symbol)] <- names(Symbol[is.na(Symbol)])
y.full$genes <- data.frame(Symbol=Symbol)
keep <- filterByExpr(y.full)
y.full <- y.full[keep, , keep.lib.sizes=FALSE]
y.full$samples
##                 group lib.size norm.factors
## GSM1480297   B.virgin 23219195            1
## GSM1480298   B.virgin 21769326            1
## GSM1480299 B.pregnant 24092719            1
## GSM1480300 B.pregnant 22657703            1
## GSM1480301  B.lactate 21522881            1
## GSM1480302  B.lactate 20009184            1
## GSM1480291   L.virgin 20385437            1
## GSM1480292   L.virgin 21699830            1
## GSM1480293 L.pregnant 22236469            1
## GSM1480294 L.pregnant 21983364            1
## GSM1480295  L.lactate 24720123            1
## GSM1480296  L.lactate 24653390            1
y.fullnorm <- calcNormFactors(y.full) ## same as normLibSizes for this data

9.4 Visualize Normalized Data and MDS Plots

sapply(1:2,function(x){plotMD(cpm(y.fullnorm, log=TRUE), column=x);abline(h=0, col="red", lty=2, lwd=2)})

## [[1]]
## NULL
## 
## [[2]]
## NULL
points <- c(0,1,2,15,16,17)
colors <- rep(c("blue", "darkgreen", "red"), 2)
plotMDS(y.fullnorm, col=colors[group], pch=points[group])
legend("topleft", legend=levels(group), pch=points, col=colors, ncol=2)

9.5 Compute DEGs with Limma

Here we use limma without the option robust = T since there is so much data

voom.fit.full <- voom(y.fullnorm,design,plot=TRUE)

limma.lm.fit.full <- lmFit(voom.fit.full, design)
limma.contrasts.fit.full <- contrasts.fit(limma.lm.fit.full, contrast.matrix)
limma.ebayes.fit.full    <- eBayes(limma.contrasts.fit.full)
limma.tt.full <- topTable(limma.ebayes.fit.full, coef = 2, number = Inf)

9.6 Compute DEGs with edgeR QL

edgeR.estimateDisp.full <- estimateDisp(y.fullnorm, design)
plotBCV(edgeR.estimateDisp.full)

edgeR.glmQLF.fit.full <- glmQLFit(edgeR.estimateDisp.full, design)
plotQLDisp(edgeR.glmQLF.fit.full)

edgeR.glmQLF.estimate.coef2.full <- glmQLFTest(edgeR.glmQLF.fit.full, contrast=contrast.matrix[,2])
edgeR.tt.full <- topTags(edgeR.glmQLF.estimate.coef2.full, n = Inf) # by default sorted by Pvalue

9.7 Compare number of DEGs and Distributions of \(p\)-values for Full Counts Data Between EdgeR and Limma

summary(decideTests(limma.ebayes.fit.full,p.value=0.1))
##        Lact_vs_Preg_in_B Lact_vs_Preg_in_L Diff_LP Preg_vs_Virg_in_B
## Down                3388              4238    3292              3065
## NotSig              9225              7471    9190              9577
## Up                  3356              4260    3487              3327
##        Preg_vs_Virg_in_L Diff_PV
## Down                4452    3327
## NotSig              7416    9790
## Up                  4101    2852

Here is the result for just contrast 2 using edgeR:

summary(decideTests(edgeR.glmQLF.estimate.coef2.full,p.value=0.1))
##        1*L.lactate -1*L.pregnant
## Down                        4377
## NotSig                      7295
## Up                          4297

In contrast with the undersampled data, with the full data, now EdgeR finds more DEGs. The histograms of \(p\)-values show that edgeR still has an excees of \(p\)-values near 1 compared to limma, but the effect is relatively smaller.

hist(limma.tt.full$P.Value)

hist(edgeR.tt.full$table$PValue) 

10 Increasing DEGs Using Local fdrs

Let’s proceed with the full dataset and the EdgeR estimates, and apply what we have learned about FDRs to increase the number of DEGs for each contrast.

First let’s get \(p\)-values for all genes for all six of our contrasts.

edgeR.glmQLF.estimate.coef1.full <- glmQLFTest(edgeR.glmQLF.fit.full, contrast=contrast.matrix[,1])
edgeR.glmQLF.estimate.coef2.full <- glmQLFTest(edgeR.glmQLF.fit.full, contrast=contrast.matrix[,2])
edgeR.glmQLF.estimate.coef3.full <- glmQLFTest(edgeR.glmQLF.fit.full, contrast=contrast.matrix[,3])
edgeR.glmQLF.estimate.coef4.full <- glmQLFTest(edgeR.glmQLF.fit.full, contrast=contrast.matrix[,4])
edgeR.glmQLF.estimate.coef5.full <- glmQLFTest(edgeR.glmQLF.fit.full, contrast=contrast.matrix[,5])
edgeR.glmQLF.estimate.coef6.full <- glmQLFTest(edgeR.glmQLF.fit.full, contrast=contrast.matrix[,6])

Next, let’s extract the full topTag tables with the \(p\)-values.

edgeR.tt.full.coef1 <- topTags(edgeR.glmQLF.estimate.coef1.full, n = Inf)
edgeR.tt.full.coef2 <- topTags(edgeR.glmQLF.estimate.coef2.full, n = Inf)
edgeR.tt.full.coef3 <- topTags(edgeR.glmQLF.estimate.coef3.full, n = Inf)
edgeR.tt.full.coef4 <- topTags(edgeR.glmQLF.estimate.coef4.full, n = Inf)
edgeR.tt.full.coef5 <- topTags(edgeR.glmQLF.estimate.coef5.full, n = Inf)
edgeR.tt.full.coef6 <- topTags(edgeR.glmQLF.estimate.coef6.full, n = Inf)

Then let’s estimate \(\pi_0\), the fraction of NDEGs, for each contrast.

library(FDRestimation)
pi0.coef1 <- get.pi0(edgeR.tt.full.coef1$table$PValue)
pi0.coef1
## [1] 0.4721648
pi0.coef2 <- get.pi0(edgeR.tt.full.coef2$table$PValue)
pi0.coef2
## [1] 0.5122425
pi0.coef3 <- get.pi0(edgeR.tt.full.coef3$table$PValue)
pi0.coef3
## [1] 0.6838249
pi0.coef4 <- get.pi0(edgeR.tt.full.coef4$table$PValue)
pi0.coef4
## [1] 0.4709124
pi0.coef5 <- get.pi0(edgeR.tt.full.coef5$table$PValue)
pi0.coef5
## [1] 0.3707183
pi0.coef6 <- get.pi0(edgeR.tt.full.coef6$table$PValue)
pi0.coef6
## [1] 0.5936502

As a reminder, our contrasts are in the order “Lact_vs_Preg_in_B Lact_vs_Preg_in_L Diff_LP Preg_vs_Virg_in_B Preg_vs_Virg_in_L Diff_PV” so as expected, the two interaction effects (differences across cell-types for the two differences in life history stages) have the highest estimated proportion of NDEGs.

Now let’s estimate the fdrs:

p.fdr.obj.coef1 <- p.fdr(p=edgeR.tt.full.coef1$table$PValue,set.pi0 = pi0.coef1)
p.fdr.obj.coef2 <- p.fdr(p=edgeR.tt.full.coef2$table$PValue,set.pi0 = pi0.coef2)
p.fdr.obj.coef3 <- p.fdr(p=edgeR.tt.full.coef3$table$PValue,set.pi0 = pi0.coef3)
p.fdr.obj.coef4 <- p.fdr(p=edgeR.tt.full.coef4$table$PValue,set.pi0 = pi0.coef4)
p.fdr.obj.coef5 <- p.fdr(p=edgeR.tt.full.coef5$table$PValue,set.pi0 = pi0.coef5)
p.fdr.obj.coef6 <- p.fdr(p=edgeR.tt.full.coef6$table$PValue,set.pi0 = pi0.coef6)
p.fdr.obj.coef1 <- p.fdr(p=edgeR.tt.full.coef1$table$PValue,set.pi0 = pi0.coef1)
p.fdr.obj.coef2 <- p.fdr(p=edgeR.tt.full.coef2$table$PValue,set.pi0 = pi0.coef2)
p.fdr.obj.coef3 <- p.fdr(p=edgeR.tt.full.coef3$table$PValue,set.pi0 = pi0.coef3)
p.fdr.obj.coef4 <- p.fdr(p=edgeR.tt.full.coef4$table$PValue,set.pi0 = pi0.coef4)
p.fdr.obj.coef5 <- p.fdr(p=edgeR.tt.full.coef5$table$PValue,set.pi0 = pi0.coef5)
p.fdr.obj.coef6 <- p.fdr(p=edgeR.tt.full.coef6$table$PValue,set.pi0 = pi0.coef6)

To compare how many more DEGs we get with this method, let’s start first with DEG statistics using edgeR’s own Benjamiini-Hochberg FDRs and an FDR cut-off of 1% (which is also what Fu et al used):

summary(decideTests(edgeR.glmQLF.estimate.coef1.full,p.value=0.01))
##        1*B.lactate -1*B.pregnant
## Down                        1641
## NotSig                     12916
## Up                          1412
summary(decideTests(edgeR.glmQLF.estimate.coef2.full,p.value=0.01))
##        1*L.lactate -1*L.pregnant
## Down                        2820
## NotSig                     10367
## Up                          2782
summary(decideTests(edgeR.glmQLF.estimate.coef3.full,p.value=0.01))
##        -1*B.lactate 1*B.pregnant 1*L.lactate -1*L.pregnant
## Down                                                  1820
## NotSig                                               12262
## Up                                                    1887
summary(decideTests(edgeR.glmQLF.estimate.coef1.full,p.value=0.01))
##        1*B.lactate -1*B.pregnant
## Down                        1641
## NotSig                     12916
## Up                          1412
summary(decideTests(edgeR.glmQLF.estimate.coef2.full,p.value=0.01))
##        1*L.lactate -1*L.pregnant
## Down                        2820
## NotSig                     10367
## Up                          2782
summary(decideTests(edgeR.glmQLF.estimate.coef3.full,p.value=0.01))
##        -1*B.lactate 1*B.pregnant 1*L.lactate -1*L.pregnant
## Down                                                  1820
## NotSig                                               12262
## Up                                                    1887
summary(p.fdr.obj.coef1$fdrs <= 0.01)
##    Mode   FALSE    TRUE 
## logical   11921    4048
summary(p.fdr.obj.coef2$fdrs <= 0.01)
##    Mode   FALSE    TRUE 
## logical    9585    6384
summary(p.fdr.obj.coef3$fdrs <= 0.01)
##    Mode   FALSE    TRUE 
## logical   11894    4075
summary(p.fdr.obj.coef4$fdrs <= 0.01)
##    Mode   FALSE    TRUE 
## logical   12131    3838
summary(p.fdr.obj.coef5$fdrs <= 0.01)
##    Mode   FALSE    TRUE 
## logical    9407    6562
summary(p.fdr.obj.coef6$fdrs <= 0.01)
##    Mode   FALSE    TRUE 
## logical   12548    3421

The numbers of DEGs increases by about 50%.

11 Visualizing DEGs Using Wordclouds

Let’s use Wordclouds to visualize DEGs for the edgeR fit for the full data counts and the local fdrs less than or equal to 1%.

11.1 Coefficient 1: Overexpressed in B cells in Lactating vs Pregnant Mice

Please note that the following shows only DEGs that are overexpressed in the B cells of Lactating Mice relative to Pregnant Mice. If we separately wanted to visualize DEGs overexpressed in pregnant mice, we would need to filter for negative log-fold-changes.

library(wordcloud2)
termdata <- edgeR.tt.full.coef1$table$Symbol[p.fdr.obj.coef1$fdrs <= 0.01 & edgeR.tt.full.coef1$table$logFC > 0]
freqdata <- edgeR.tt.full.coef1$table$logFC[p.fdr.obj.coef1$fdrs <= 0.01 & edgeR.tt.full.coef1$table$logFC > 0]

termdataFC <- edgeR.tt.full.coef1$table$logFC[p.fdr.obj.coef1$fdrs <= 0.01 & edgeR.tt.full.coef1$table$logFC > 0]
termdata <- termdata[order(termdataFC,decreasing=TRUE)]
freqdata <- freqdata[order(termdataFC,decreasing=TRUE)]

# limit to top 500 terms
termdata <- termdata[1:500]
freqdata <- freqdata[1:500]

# add a title to wordcloud2, from
# https://stackoverflow.com/questions/66957909/add-a-title-and-remove-whitespace-from-wordcloud2
layout(matrix(c(1, 2), nrow = 2), heights = c(1, 1))
par(mar = rep(0, 4))
plot.new()
text(
  x = 0.5,
  y = 0.5,
  "Overexpressed in B cells in Lactating vs Pregnant Mice",
  cex = 1.5,
  font = 2
)

wordcloud2(data = data.frame(word=termdata,freq=freqdata),size=0.2)

11.2 Coefficient 2: Overexpressed in L cells in Lactating vs Pregnant Mice

library(wordcloud2)
termdata <- edgeR.tt.full.coef2$table$Symbol[p.fdr.obj.coef2$fdrs <= 0.01 & edgeR.tt.full.coef2$table$logFC > 0]
freqdata <- edgeR.tt.full.coef2$table$logFC[p.fdr.obj.coef2$fdrs <= 0.01 & edgeR.tt.full.coef2$table$logFC > 0]

termdataFC <- edgeR.tt.full.coef2$table$logFC[p.fdr.obj.coef2$fdrs <= 0.01 & edgeR.tt.full.coef2$table$logFC > 0]
termdata <- termdata[order(termdataFC,decreasing=TRUE)]
freqdata <- freqdata[order(termdataFC,decreasing=TRUE)]

# limit to top 500 terms
termdata <- termdata[1:500]
freqdata <- freqdata[1:500]

# add a title to wordcloud2, from
# https://stackoverflow.com/questions/66957909/add-a-title-and-remove-whitespace-from-wordcloud2
layout(matrix(c(1, 2), nrow = 2), heights = c(1, 1))
par(mar = rep(0, 4))
plot.new()
text(
  x = 0.5,
  y = 0.5,
  "Overexpressed in L cells in Lactating vs Pregnant Mice",
  cex = 1.5,
  font = 2
)

wordcloud2(data = data.frame(word=termdata,freq=freqdata),size=0.2)

11.3 Coefficient 3: Excess Change in L cells Relative to B cells in Lactating vs Pregnant Mice

library(wordcloud2)
termdata <- edgeR.tt.full.coef3$table$Symbol[p.fdr.obj.coef3$fdrs <= 0.01 & edgeR.tt.full.coef3$table$logFC > 0]
freqdata <- edgeR.tt.full.coef3$table$logFC[p.fdr.obj.coef3$fdrs <= 0.01 & edgeR.tt.full.coef3$table$logFC > 0]

termdataFC <- edgeR.tt.full.coef3$table$logFC[p.fdr.obj.coef3$fdrs <= 0.01 & edgeR.tt.full.coef3$table$logFC > 0]
termdata <- termdata[order(termdataFC,decreasing=TRUE)]
freqdata <- freqdata[order(termdataFC,decreasing=TRUE)]

# limit to top 500 terms
termdata <- termdata[1:500]
freqdata <- freqdata[1:500]

# add a title to wordcloud2, from
# https://stackoverflow.com/questions/66957909/add-a-title-and-remove-whitespace-from-wordcloud2
layout(matrix(c(1, 2), nrow = 2), heights = c(1, 1))
par(mar = rep(0, 4))
plot.new()
text(
  x = 0.5,
  y = 0.5,
  "Excess Overexpression in L cells Relative to B cells in Lactating vs Pregnant Mice",
  cex = 1.5,
  font = 2
)

wordcloud2(data = data.frame(word=termdata,freq=freqdata),size=0.2)